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Abstract 

We describe a fast parallel iterative method for computing molecular absorption spectra within TDDFT linear 
response and using the LCAO method. We use a local basis of "dominant products" to parametrize the space of 
orbital products that occur in the LCAO approach. In this basis, the dynamical polarizability is computed iteratively 
within an appropriate Krylov subspace. The iterative procedure uses a a matrix-free GMRES method to determine the 
(interacting) density response. The resulting code is about one order of magnitude faster than our previous full-matrix 
method. This acceleration makes the speed of our TDDFT code comparable with codes based on Casida's equation. 
The implementation of our method uses hybrid MPI and OpenMP parallelization in which load balancing and memory 
access are optimized. To validate our approach and to establish benchmarks, we compute spectra of large molecules 
i-Q on various types of parallel machines. 

The methods developed here are fairly general and we believe they will find useful applications in molecular 
physics/chemistry, even for problems that are beyond TDDFT, such as organic semiconductors, particularly in photo- 
C3 voltaics. 

The manuscript is submitted to Journal of Chemical Theory and Computation, 28.05.2010. 
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1 Introduction 

The standard way to investigate the electronic structure of matter is by measuring its response to external electromagnetic 
fields. To describe the electronic response of molecules, one may use time-dependent density functional theory (TDDFT) 
PP. TDDFT has been particularly successful in the calculation of absorption spectra of finite systems such as atoms, 
molecules and clusters [TJ [5] and for such systems it remains the computationally cheapest ab-initio approach without 
any empirical parameters. 

In the framework of TDDFT [3; , time-dependent Kohn-Sham like equations replace the Kohn-Sham equations of the 
static density- functional theory (DFT). Although these equations can be applied to very general situations [4], we will 
restrict ourselves to the case where the external interaction with light is small. This condition is satisfied in most practical 
applications of spectroscopy and can be treated by the linear response approximation. 

TDDFT focuses on the dependent electron density n(r,t). One assumes the existence of an artificial time dependent 
potential Vks^, t) in which (equally artificial) non interacting reference electrons acquire exactly the same time dependent 
density n(r,t) as the interacting electrons under study. The artificial time dependent potential VKs{r,t) is related to the 
true external potential V cxt (r,t) by the following equation 

V K s(r,t) = V cxt (r,t) + V s (r,t) + V m (r,t). 

Here Va(r,t) is the Coulomb potential of the electronic density n(r,t) and V^ c {r,t) the exchange-correlation potential. 
The exchange-correlation potential absorbs all the non trivial dynamics and in practice it is usually taken from numerical 
studies of the interacting homogeneous electron gas. 

The above time dependent extension of the Kohn-Sham equation was used by Gross et al [5j to find the dynamical 
linear response function \ = sv (' t') °^ an interacting electron gas, a response function that expresses the change of the 
electron density dn(r,t) upon an external perturbation 8V ex t(r',tf). From the change 5n(r,t) of the electron density one 
may calculate the polarization 6P — J 6n(r, t) r d 3 r that is induced by an external field 5V cy ±(r, t). The imaginary part 
of its Fourier transform §P(ui) provides us with the absorption coefficient and the poles of its Fourier transform provide 
information on electronic transitions. 

Most practical implementations of TDDFT in the regime of linear response are based on Casida's equations [6] [7] . 
Casida has derived a Hamiltonian in the space of particle-hole excitations, the eigenstates of which correspond to the 
poles of the interacting response function \{r,r' ,ui). Although Casida's approach has an enormous impact in chemistry 
[7j, it is computationally demanding for large molecules. This is so because Casida's Hamiltonian acts in the space of 
particle-hole excitations, the number of which increases as the square of the number of atoms. 

Alternatively, one may also solve TDDFT linear response by iterative methods. A first example of an iterative method 
for computing molecular spectra is the direct calculation of the density change 5n(r, u>) in a modified Sternheimer approach 
[TJ E]. In this scheme one determines the variation of the Kohn-Sham orbitals 5ip(r, t) due to an external potential without 
using any response functions. By contrast, use of the response functions allowed van Gisbergen et al [HI IS] to develop a 
self-consistent iterative procedure for the variation of the density Sn(r, t) without invoking the variation of the molecular 
orbitals. There exists also an iterative approach based on the density matrix [101 lllj . This approach is quite different 
from ours, but its excellent results, obtained with a plane-wave basis, serve as a useful test of our LCAO based method. 

Over the last few years, the authors of the present paper developed and applied a new basis in the space of products 
that appear in the application of the LCAO method to excited states [T2J EH Q3]. This methodological improvement 
allows for a simplified iterative approach for computing molecular spectra and the present paper describes this approach. 
We believe that the methods developed here will be useful in molecular physics, not only in the context of TDDFT, but 
also for systems such as large organic molecules that, because of their excitonic features, require methods beyond ordinary 
TDDFT. 

This paper is organized as follows. In section [2j we briefly recall the TDDFT linear response equations. In section [3j 
we introduce a basis in the space of products of atomic orbitals. In section HI we describe our iterative method of 
computing the polarizability and in section [5l we explain how the interaction kernels are computed. Section [6] describes 



the parallelization of our iterative method in both multi-thread and multi-process modes. In section [71 we present results 
and benchmarks of the parallel implementation of our code. We conclude in section [HJ 

2 Brief review of linear response in TDDFT 

To find the equations of TDDFT linear response, one starts from the time dependent extension of the Kohn-Sham equation 
that was already mentioned in the introduction and which we rewrite as follows 

V KS ([n],r,t) = V cx t([n},r,t) + Vn([n},r,t) + V xc ([n},r,t). 

The notation [n] indicates that the potentials VkSj Vr and V xc in this equation depend on the distribution of electronic 
charge n(r,t) in all of space and at all times. To find out how the terms of this equation respond to a small variation 
Sn(r, t) of the electron density, we take their variational derivative with respect to the electron density 

SV K s([n],r,t) SV ext {{n],r,t) 5 

Sn(ri,t>) = 5n(r>,t>) + fofrT*) CM"^*) + ^(N,r,*)] . 

The important point to notice here is that — - — and — - — are inverses of xo — t- and x — ~r- that represent the 

on On ' OVks oV cxt 

response of the density of free and interacting electrons to, respectively, variations of the potentials Vks and V ext - As the 

response xo of free electrons is known and since we wish to find the response x °f interacting electrons, we rewrite the 

previous equation compactly as follows 

X -1 = Xo" 1 - /hxc- (1) 

Here an interaction kernel /h xc = -tt- \Va + V xc ] is introduced. Equation ( 1 ) has the form of a Dyson equation that is 

on LI 

familiar from many body perturbation theory [15]. 

To make use of equation (fTl), it remains to specify xo and /h X c- The Kohn-Sham response function xo can be 

computed in terms of orbitals and eigenenergies Q] as will be discussed later in this section (see equation ^ below). For 

the exchange-correlation potential V xc ([n],r,t), we use the adiabatic local density approximation (ALDA) that is local in 

both time and space, V xc ([n],r,t) — V xc (n(r,t)), therefore the interaction kernel /h X c will not depend on frequency 

^ AM 

/hxc = ] 77 + S(r 



r — r'\ dn 

In principle one could determine the interacting response function x( r i t> r \ t') from equation (fTl), determine the variation 
of density 5n(r,t) — /x(r, t, r' , t')8V cx t(f' , t')d 3 r'dt' due to a variation of the external potential 8V cx t and find the 
observable polarization SP = J Sn(r, t)r d 3 r. To realize these operations in practice, we use a basis of localized functions 
to represent the space dependence of the functions Xo( r , r ' , w ) and fnxc{r, r'). 

In order to introduce such a basis into the Petersilka-Gossmann-Gross equation (1TJ), we eliminate the inversions in 
this equation and transform to the frequency domain 

X(r,r',ui) = xo(r,r',u) + / d 3 r 1 d 3 r 2 xo(r,r 1 , u)fB XC (r 1 ,r 2 )x{f2,r', u). (2) 



The density response function of free electrons Xo( r , r ' ■, w) can be expressed [I] in terms of molecular Kohn-Sham orbitals 
as follows 

Xo (r,r ,u = > (n F -n E ) t^ ^r— • (3) 

%-p u-{E-F)+j£ 

Here n F and ipE(r) are occupation factors and Kohn-Sham eigenstates of eigenenergy E, and the constant e regularizes 
the expression, giving rise to a Lorentzian shape of the resonances. The eigenenergies E and F are shifted in such a 
way that occupied and virtual states have, respectively, negative and positive energies. Only pairs E,F of opposite signs, 
E ■ F < 0, contribute in the summation as is appropriate for transitions from occupied to empty states. 
We express molecular orbitals iPe(t) as linear combinations of atomic orbitals (LCAO) 

V^(r)=Afr(r), (4) 

where f a (r) is an atomic orbital. The coefficients X® are determined by diagonalizing the Kohn-Sham Hamiltonian that 
is the output of a prior DFT calculation and we assume these quantities to be available. For the convenience of the 
reader, we use Einstein's convention of summing over repeated indices. 



3 Treatment of excited states within LCAO 

The LCAO method was developed in the early days of quantum mechanics to express molecular orbitals as linear combi- 
nations of atomic orbitals. When inserting the LCAO ansatz Q into equation p| to describe the density response, one 
encounters products of localized functions f a (r)f b (r) — a set of quantities that are known to be linearly dependent. 

There is an extensive literature [THl 03 EH US] on the linear dependence of products of atomic orbitals. Baerends uses 
an auxiliary basis of localized functions to represent the electronic density [13 [2H] ■ His procedure of fitting densities by 
auxiliary functions is essential both for solving Casida's equations and in van Gisbergen's iterative approach. 

In the alternative approach of Beebe and coauthors [16], one forms the overlaps of products (ab\a'b'} to disentangle 
the linear dependence of the products f a (r)f b (r). The difficulty with this approach is its lack of locality and the 0(N 4 ) 
cost of the construction of the overlaps [ST] . 

Our approach is applicable to numerically given atomic orbitals of finite support and, in the special case of products 
on coincident atoms, it resembles an earlier construction by Aryasetiawan and Gunnarsson in the muffin tin context [22] . 
We ensure locality by focusing on the products of atomic orbitals for each overlapping pair of atoms at a time [12l [13] . 
Because our construction is local in space, it requires only O(N) operations. Our procedure removes a substantial part 
of the linear dependence from the set of products {f a (r)f b (r)}. As a result, we find a vertex like identity for the original 
products in terms of certain "dominant products" F x (r) 

/»/V)~ Yl Vx b F X (r). (5) 

A>A min 

Here the notation V£ b alludes to the fact that the vertex V£ b was obtained from an eigenvalue problem for the pair of 
atoms that the orbitals a, b belong to. The condition A > A m i n says that the functions corresponding to the (very many) 
small eigenvalues were discarded. By construction, the vertex V£ b is non zero only for a few orbitals a, b that refer to the 
same pair of atoms that A belongs to and, therefore, V£ b is a sparse object. Empirically, the error in the representation 
(pi) vanishes exponentially fast [33J with the number of eigenvalues that are retained. The convergence is fully controlled 
by choosing the threshold for eigenvalues A m in arL d from now on we assume equality in relation pi). 

We introduce matrix representations of the response functions xo and X m the basis of the dominant products {^(r)} 
as follows _ __ 

Xo (r, r ; , w) = J2 F^r) x %(u)F»(r>); x {r, r', w) = £ F^v) X ^)F v {t'). (6) 

(XV (Ay 

For the non interacting response xLui^i one nas an explicit expression 

x ^)= ^ (n,-n g ) u _ {E _ F) + ie , (7) 

abcd,E,F v ' 

which can be obtained by inserting the LCAO ansatz Q and the vertex ansatz pi) into equation pi). 

We insert the expansions ([Fj]) into the Dyson equation (pi) and obtain the Petersilka-Gossmann-Gross equation in 
matrix form 

Xiiv(u) = X°„M +X° /i '( w )/Hxc Mx^v(w), (8) 

with the kernel /^ c defined as 



C = J d 3 rd 3 r'F^(r)/ Hx c(r,r')^(r'). (9) 

The calculation of this matrix will be discussed in section \E\ 

Since molecules are small compared to the wavelength of light, one may use the dipole approximation and define the 
polarizability tensor 

P lk (uj) = / (fr<Pr'rir' k x{ry,u). 

Moreover, using equation p[) we find an explicit expression for the interacting polarizability tensor Pik(u>) in terms of the 
known matrices x° ancl /hxc 

Pik(u)=dt(- ^^-) xlAu)<, (10) 



X°(w)/h XC/ ^ 



where the dipole moment has been introduced d^ = J f (r)r< d 3 r. Our iterative procedure for computing molecular 



absorption spectra is based on equation (10). Below, we will omit Cartesian indices i and k because we compute tensor 



components Pjfc(oj) independently of each other. 



4 An iterative method for the calculation of the dynamical polarizability 



In equation ( 10 ) the polarizability P(w) is expressed as a certain average of the inverse of the matrix A v = 8" — 
X° , (u})fftxc- A direct inversion of the matrix A^ is straightforward at this point, but it is computationally expensive and 
of unnecessary (machine) precision. Fortunately, iterative methods are available that provide the desired result at much 
lower computational cost and with a sufficient precision. 

In fact, we already used an iterative biorthogonal Lanczos method to create an approximate representation of the 
matrix A£, which can be easily inverted within the Krylov subspace under consideration |13j . In this work, however, we use 



a simpler approach of better computational stability to compute the polarizability (10). First, we calculate an auxiliary 
vector X = A~ 1 x°d by solving a linear system of equations AX = \°d with an iterative method of the Krylov type. 
Second, we compute the dynamical polarizability as a scalar product P = cP X^. In this way we avoid the construction 
of the full and computationally expensive response matrix \ . Below, we give details on this iterative procedure. 

4.1 The GMRES method for the iterative solution of a linear system of equations 

As we explained above, the polarizability P(w) can be computed separately for each frequency, by solving the system of 
linear equations 

(i-x°M/h X c)xh = x VK (ii) 

and by computing the dynamical polarizability as a scalar product P(lu) — d^X^{u). 



We apply a generalized minimal residual method (GMRES) [2HI2S] to solve the linear system of equations ( 11 ), which 
is of the form AX = b. GMRES belongs to the Krylov-type methods [2"il l2l)] that represent a large matrix A in an 
iteratively built up Krylov-type basis |0), |1) . . . |i). The first vector |0) in the basis is chosen equal to |6), while further 
vectors are computed recursively via \i) = A\i — 1). As the vectors \i) — A l \0) are not mutually orthogonal, one may 
enforce their orthogonality by using the Gram-Schmidt method 

\i)=A\i-l)-J2\J)(M\i-l)- (12) 

The orthonormal basis built in this way is used in the GMRES method to approximately solve the linear system of 



equations A\X) — \b) by minimizing the residual \r) — A\X) — \b) within the Krylov-type subspace ( 12 ). The minimization 
of the residual occurs when the equation X^(*l^lj)0l :E ) = (*l^) i s satisfied and this set of equations is of much smaller 
size than the original problem. When the solution in the Krylov subspace (i\x) is found, then an approximate solution in 
the original space can be computed from \X) = J2i \i)(i\ x )- 

A suitable stopping criterion is essential for our method and we tested several criteria in order to keep the number of 
iterations small and achieve a reliable result at the same time. The conventionally used criterion that e r — \r\/\b\ should 
be small is unreliable when the tolerance threshold is comparatively large (e r ~ 1%). Therefore we suggest an alternative 
combined criterion. 

A natural stopping criterion for an iterative solver of the linear system of equations AX = b is a condition on the 
relative error of the solution ex = |AA|/|A|. 

In our case, the quantity of interest is the dynamical polarizability P = (d\X), therefore it is meaningful to impose a 
limit on the relative error of the polarizability ep = |AP|/|P|. Estimations of the errors ex and ep can be easily obtained 
because |AA) = A _1 |r) and AP = (dlA^ 1 ^). We estimate |AA") and AP using a matrix norm [27], AX « |.A -1 ||r) and 

AP « |-A~ 1 |(d|r). We used the Frobenius norm of the Krylov representation of the matrix A~ l , |A _1 | ss \/Ylij K*!^" 1 ^)! 2 

because A is a non Hermitian matrix. 

Both errors ex and ep tend to zero in the limit X — > A cxact , but for a threshold in the range of a few percent they 
behave differently. The condition upon ep is better tailored to the problem and it saves unnecessary iterations in many 
cases. However, in a few cases, the condition upon ep fails, while the condition upon ex works. Therefore, we use a 
condition on ep in "quick and dirty" runs and a combined condition ep < £toiorancc and ex < £toicrancc for reliable results. 

In general, iterative methods of the Krylov type involve only matrix-vector products A\z). For an explicitly given 
matrix A, the operation \z) — > A\z) requires 0(N 2 ) operations. Therefore the whole iterative method will scale as 
0(N 2 Ni tcr ), where A^tcr is the number of iterations until convergence. This is better than direct methods when A^tor *C N, 
because a matrix-matrix multiplication takes 0(N 3 ) operations. 

To avoid matrix multiplications, the application of the matrix A = 1 — x°(u;)/hxc to a vector \z) is done sequentially 
by computing first \z') — /hxc|z) and then A\z) — \z) — \°{uj)\z'). The kernel matrix /h X c is computed before the iterative 
procedure. Because it is frequency independent, it can be easily stored and reused. By contrast, the response matrix 
X°(w) is frequency dependent and computationally expensive and its explicit construction should be avoided. Therefore, 
only matrix-vector products x°( UJ )\ z ) will be computed as explained below without ever calculating the full response 
matrix x°(w). 



4.2 Fast application of the Kohn-Sham response matrix to vectors 

In previous papers [131 E], we have described an 0(N 2 ) construction of the entire response function Xui/( w )j but the 
prefactor was large. Paradoxically, the Krylov method presented above allows for a much faster computation of the 
absorption spectrum, although the cost of matrix-vector products x (w)|z) scales asymptotically as 0(N 3 ) (see below). 
Starting point of our construction of the matrix-vector product x°(o;)|z) is the expression (rZJ) for the Kohn-Sham 
response matrix in the basis of dominant products 



\ jIU 



abcd,E,F 



{X E V^X F ){X F VfX E ) 



(E-F) + ie 



To compute this matrix-vector product efficiently, we decompose its calculation into a sequence of multiplications that 
minimizes the number of arithmetical operations by exploiting the sparsity of the vertex V® b . The sequence we chose is 
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Figure 1: A sequence of operations to compute the matrix-vector product. ^L v z v . 

graphically represented in figure [T] For clarity, the frequency dependent denominator is omitted. Boxes represent the 
products to be performed at different steps. The innermost box contains a frequency independent quantity, that is also 
used in the last step. An algebraic representation of the computational steps is given in table [T] 



Step 


Expression 


Complexity 


Memory 


Details of the Complexity 


1 


n cbl T/cd ViS 


0(N 2 ) 


0(N 2 ) 


■^ * occ^orb prod 


2 


/3 cE = a fv v 


0(N 2 ) 


0{N 2 ) 


I * id -^ * occ^l orb -^ * prod 


3 


a FE = X F (3 cE 


0(N 3 ) 


0(N 2 ) 


NujNoccNyirtNorb 


4 


,,E v^ Y F n FE 
lb = l^F X b a 


0(N 3 ) 


0{N 2 ) 


NujNoccNvirtNorb 


5 


^ = E B a^7f 


0(N 2 ) 


O(N) 


■^ * uj -^ * occTl orb -^ * prod 



Table 1: The complexity and memory requirements during the computation of the matrix-vector product x {w)z. There 
are steps in the sequence with 0(N 2 ) and 0(N 3 ) complexity, where N is the number of atoms. N u denotes the number 
of frequencies for which the polarizability is computed, and the remaining symbols are explained in the text. 



In the first stage of the algorithm, we compute an auxiliary object o. c v 



cE — \icd vE 



by construction. Therefore, we will spend only N . 



hNp ro d operations, where N 



The vertex V£ d is a sparse object 
is the number of occupied orbitals, 



Nprod is the number of products and n or b is the number of orbitals that belong to a pair of atoms. The auxiliary object 
a.1 E is sparse and frequency independent. Therefore we store a^ E and reuse it in the 5-th step of the matrix-vector 



product. The product /3' 



cE 



af^V 



will cost N occ n or bN proi i operations because each product index v "communicates" 



with the atomic orbital index c in one or two atoms. The matrix j3 cE will be full, therefore the product a FE = X F j3 cE will 
cost N occ N virt N or i, i.e. this step has 0(N 3 ) complexity with N virt the number of unoccupied (virtual) states. The next 



step 7;f = J2 F X F a FE also has 0(N 3 ) complexity with the same operation count. Finally, the sum n^ = J2e X a^ r p ab lb 



b„,E 



takes only 0(N ) operations because the vertex V® is sparse. The sequence involves 0(N 3 ) operations, but due to 
prefactors, the run time is dominated by the 0(N 2 ) part of the sequence for molecules of up to a hundred atoms. 



5 Calculation of the interaction kernels 



In the adiabatic local-density approximation (ALDA), the interaction kernel 

, , , SVh , SV„ 

/H + Jxc — -j \- ~? 

on on 

is a frequency-independent matrix. While the representation of the Hartree kernel is straightforward in our basis 

1 



/h = // d s rd s r' F"(r) 



^T^(0, 



(13) 



the exchange-correlation kernel is not known explicitly and must be approximated. The locality of the LDA potential 
V xc (r) = V xc (n(r)) leads to a simple expression for the exchange-correlation kernel 



/&"= d 3 rF»(r)f xc (r)F»(r), 



(14) 

where / xc (r) is a (non linear) function of the density n(r). In this section, we describe how the Hartree and exchange- 
correlation kernels are constructed. 

5.1 The Hartree kernel 



The basis functions F p (r) that appear in the interaction kernels (13) and (14) are built separately for each pair of atoms 



They are either local or bilocal depending on whether the atoms of the pair coincide or not. While the local products are 
spherically symmetric functions, the bilocal products possess only axial symmetry. Because of their axial symmetry, the 
bilocal products have a particularly simple representation in a rotated coordinate system 



Jcutoff 



F"(r)= Y, F ?(r')Y jm (Rr'), r' = r c, 



(15) 



where the rotation R and the center c depend on the atom pair. In the rotated frame, the .Z-axis coincides with the line 
connecting the atoms in the pair. We use radial products F^(r) that are given on a logarithmic grid. The local products 
have a simpler, LCAO-like representation 



F"(r) = F"(|r-c|)Y im (r-c), 



(16) 



where the centers c coincide with the center of the atom. 



Using the algebra of angular momentum, we can get rid of the rotations R in the bilocal basis functions ( 15 ) and 



transform the kernel (13) into a sum over two center Coulomb integrals Cb(l,2) 

Cb(l,2) = // d 3 ri d 3 r 2 g; imi (r 1 -c 1 )|r 1 - r 2 | -1 gi 2m2 (r 2 - c 2 ). 



(17) 



The elementary functions gi m (r) — gi(r)Yi m (r) have a radial-angular decomposition similar to local products (16). The 
Coulomb interaction (17) becomes local in momentum space 



(18) 



cb(i,2) = Jd 3 P g hmi (p)p- 2 J p(ci - C2 hhmM, 

where the Fourier image of an orbital gi m (p) conserves its spherical symmetry 

Sm(p) = igi(p)Y, ro (p). (19) 

The radial part g; (p) is given by the Hankel transform of the original radial orbital in coordinate space 



|(P) 



li(r) ]i(pr)r 2 dr. 



(20) 



Inserting the expression (19) into equation (18), expanding the plane wave e lp( - Cl C2 - 1 in spherical harmonics and using 



the algebra of angular momentum, we can reduce the integration in momentum space to a one-dimensional integral 



l h,h,i 



(R) 



Ship) m(pR) gi 2 (p)dp, 



(21) 



where R= \c\ — c 2 |. 

We distinguish two cases in treating this integral according to whether the orbitals g; m (r) are overlapping or not. 
For overlapping orbitals, we compute the integral (21) numerically using Talman's fast Hankel transform [213 ■ For non 



overlapping orbitals, one can compute the integral (21) exactly using a multipole expansion. To derive this expansion, 
we replace the functions gi(p) in the equation (21) by their Hankel transforms (20) 

q />oo />oo />oo 

l hM {R) = ~ dn&Jnyl dr 2 g h (r 2 )r 2 2 j h (n) hipR) iiM) dp- (22) 

t Jo Jo Jo 




Figure 2: The spatial support of the integrand of the exchange correlation term (14) depends on the support of its 
underlying atomic orbitals and on the geometry of the quadruplet of atoms under consideration. For neighboring atoms, 
the support of the orbitals is several times larger than their inter atomic distances and this results in a nearly spherical 
support of the integrand. The figure illustrates the case of (hypothetical) CO3 



The integral over three spherical Bessel functions h 1 .i 2 ,i{ri,r2 1 R) = L ji t (ri)ji(pR)ih ( r 2) dp reduces to a simple separa- 
ble expression [29 provided two conditions are satisfied, R > r±+r2 (the basis functions do not overlap), and < I < h + h 
(the triangle relation for angular momentum). Under these conditions, we obtain 



lhA 2 ,i(n,r2,R) = Sij 1+ i. 



tt 3 / 2 ^ 2 



r(/ + i/2) 



8 i?'+! r(Zi+3/2)r(/ 2 +3/2)" 



(23) 



Inserting this result into equation (22), we obtain an expression for the Coulomb interaction in terms of moments in 
closed form 

/>oo 

Pi= r 2 drgi(r)r l . 



(24) 



The moments (24) can be computed and stored at the beginning of the calculation. Therefore, the calculation will consist 



of summing the angular-momentum coefficients and this is clearly much faster than a direct numerical integration in 



equation (21) 



The complexity of the near-field interactions will be proportional to the number of atoms N. The calculation of the 



multipoles (24) for the far-field interaction requires of O(N) mathematical operations. The remaining part of the far-field 
interactions (Wigner rotations) scales as TV 2 with the number of atoms. 



5.2 The exchange— correlation kernel 



Unlike the Hartree kernel, the exchange-correlation kernel (14) is local and we therefore compute it directly in coordinate 



space by numerical quadrature. The support of the dominant products F^(r), F u (r) in the integrand of equation (14) 
have, in general, the shape of a lens. Therefore, the support of the integrand will generally be an intersection of two 
lenses. 

We found, however, that integration in spherical coordinates gives sufficiently accurate and quickly convergent results. 
This is because the important matrix elements involve neighboring dominant products and the support of the dominant 
products is large compared to the distance between their centers. Therefore, the integrands of the important matrix 
elements (14) have nearly spherical support. This situation is illustrated by the cartoon in figure [2j 

For each pair of dominant products, we use spherical coordinates that are centered at the midpoint between them. We 
use Gauss-Legendre quadrature for integrating along the radial coordinate and Lebedev quadrature [30] for integrating 
over solid angle 

N r No 

f xc = J2 G *I2 l j /« ( r * ' ^ ) • ( 25 ) 



Here Gi and r» are weights and knots of Gauss-Legendre quadrature, and Lj and Vlj are weights and knots of Lebedev 
quadrature. The number of points N r x Nq can be kept reasonably small (24 x 170 by default). 

The most time consuming part of the exchange-correlation kernel is the electronic density. We found that calculating 
the density using the density matrix 



n(r) = ]T f a (r)D ab f b (r), where D ab = £ X*X, 



(26) 



E<0 



provides a linear scaling of the run time of / xc with the number of atoms. However, a calculation of the density via 
molecular orbitals __ _ 

n(r) = J2 Mr)Mr), where ^ E (r) = ^X a E /» (27) 

E<0 a 

is faster in many cases, although the run time scales quadratically with the number of atoms. In order to optimize the run 
time, rather than insist on linear scaling, we choose the calculational approach automatically depending on the geometry 



of the molecule. For instance, in the case of a long polythiophene with 13 chains (see subsection 7.2) we use the O(N) 
method, while in the other examples it is better to use the 0(N 2 ) approach. 

6 Parallelization of the algorithm 

The overall structure of our algorithm is given in figure[3] First, the basis of dominant products is built, then the interaction 
kernels are computed, and finally both are used in the iterative procedure to compute the dynamical polarizability. 



Construction of the dominant products 

for atom species do 
| Build local dominant products 

!$OMP PARALLEL DO 

for atom pairs (a, b) do 
| Build bilocal dominant products 



Computation 

/xc 

!$OMP PARALLEI 

for each couple 
Build block 


of the interaction kernels 


/h 


and 


j DO 

of atom pairs (a, b; c 
(a, 6; c, d) of kernels. 


d) 


do 







Computation of the dynamical polarizability 

IJOMP PARALLEL DO 

for W G [w bcgin ,W cnd ] do 

Solve for \X): (l - x°M/h*c) I*) - X»M) 
Compute polarizability -P(w) = (d\X) 



Figure 3: Skeleton of the algorithm. Its first (and computationally easiest) part is the construction of the dominant 
products. The second (and computationally most demanding) part is the construction of the interaction kernels. 
The third (and comparatively easy) part is the iterative calculation of the dynamical polarizabilities P(ui). 

The individual components of the algorithm [3] suggest different parallelization strategies. The dominant products are 
built for each atom pair independently, therefore the corresponding code is parallelized over atom pairs. The structure 
of the dominant products suggests a block wise computation of the interaction kernels. These blocks are mutually 
independent, therefore we parallelize their construction. The dynamical polarizabilities are calculated independently for 
each frequency and are, therefore, parallelized over frequencies. Below, we go through the details of the algorithm and 
its hybrid OpenMP/MPI parallelization. 

6.1 Multi-thread parallelization 

Modern computers are faster than previous generations of machines mainly due to their parallel design as in the case 
of general-purpose multi-core processors. For specially written programs, such a design allows to run several tasks 
or "threads" simultaneously. Fortunately, it is easy to write a multi-threaded program using the current application 
programming interface OpenMP. Moreover, in OpenMP, data exchange between threads uses the common memory and it 
is, therefore, faster than on distributed-memory machines. For these reasons, our main emphasis here is on multi-threaded 
(or shared-memory) parallelization. We use the OpenMP standard [31] that allows for an efficient parallelization of all 
three sections of the algorithm [3| 

6.1.1 Building the basis of dominant products 

The construction of local dominant products involves only the atomic species that occur in a molecule. Therefore it is 
computationally cheap and any parallelization would only slow down their construction. 

For bilocal dominant products, the situation is different. The construction of bilocal dominant products is done for 
all atom pairs, the orbitals of which overlap. Because these pairs are independent of each other, we parallelize the loop 
over pairs with OpenMP directives. 



The dominant products F^(r) and vertices V^ b are stored in suitably chosen data structures to allow for effective use 
of memory. Due to the locality of our construction, the amount of memory spent in the storage of dominant products 
and vertices grows linearly with the number of atoms. 



6.1.2 Construction of the interaction kernels 



The interaction kernels ( 13 ) and ( 14 ) refer to a pair of dominant products F^ (r) and F v (r ) . Because each of the dominant 
products in turn refers to a pair of atoms, the interaction kernel splits into blocks that are labeled by a quadruplet of atoms 
(a, b; c, d). In practice, this is used to precompute and reuse some auxiliary quantities that belong to such a quadruplet. 
The block structure is schematically depicted in figure [4] Generically, a block is rectangular, but when two pairs coincide, 
the block reduces to a triangle because of reflection symmetry. 
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Figure 4: The block structure of the interaction kernels. The blocks are defined by two atom pairs (a, b) and (c, d). The 
size of blocks is known from the construction of the dominant products. Because the entire interaction kernel /hxc * s 
symmetric, only its upper triangular part is computed. 

The calculation of the interaction kernels is parallelized over blocks using dynamical scheduling. Because the compu- 
tational load of each block is proportional to its area, we minimize the waiting time at the end of the loop by a descending 
sort of the blocks according to their area. 



6.1.3 Computation of the dynamical polarizability 



According to equation (10 1, the dynamical polarizability Pik(uj) is computed independently for each frequency w. There- 
fore, the loop over frequencies in the algorithm [3] is embarrassingly parallel. 

In the parallelization over frequencies, we had to make thread safe copies of module variables (working arrays in the 
GMRES solver) using the OpenMP directive threadprivate. This multiplies the memory requirement by the number of 
threads, but this poses no problem, because this part of the algorithm is not memory intensive. 



The number of iterations to reach convergence of the polarizability tensor ( 10 ) varies with frequency and with the 



Cartesian components in an irregular way. To take this into account, we use a dynamic schedule with a single frequency 
and tensor component per thread. By treating tensor components on the same level as frequencies, we reduce the body 
of a loop by a factor of three provided only diagonal components are computed. 



6.2 Hybrid MPI-thread parallelization 

Most current supercomputers are organized as clusters of multi-core nodes that are interconnected by a high speed 
network. Although the number of cores grows over the years, we still need several nodes for greater computational speed 
and to provide sufficient memory. 

Our program has also been adapted to such distributed- memory parallel machines. We parallelize according to 
the Single Program Multiple Data (SPMD) paradigm with the message passing interface (MPI) to speed up only the 
computationally intensive parts of the algorithm — the construction of the bilocal products, the calculation of the kernels 
/hj /xc and the iterative procedure. Moreover, each MPI process uses the multi-thread parallelism described above. 
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6.2.1 Parallelization of the basis of dominant products 

Only bilocal dominant products must be constructed in parallel. As described in subsection |6.1.1| the construction is 
naturally parallel in terms of atom pairs. Therefore, we distribute the atom pairs prior to computation and gather data 
after the computation in order to duplicate basis functions on each MPI process. 

6.2.2 Parallelization of the interaction kernels 

As explained in subsection |6.1.2[ the interaction kernel depends on quadruplets of atoms and it would appear natural to 
parallelize their construction over these quadruplets. We found it advantageous, however, to slice the interaction matrix 
into vertical bands that belong to one or more atom pairs and process them on different nodes. After the computation, 
the complete matrix of interactions is reconstituted on each node. 

The optimal size of each vertical band is determined by an estimate of the work load prior to the computation. Since 
the Hartree and the exchange-correlation kernel differ in their properties (such as locality) their matrices are distributed 
differently. In both cases, however, the total work load is the sum of the work loads of its constituent quadruplets. 



In the case of the Hartree kernel (13), the work load of a block depends on its size and on whether its atom pairs 



are local or bilocal. While the local products are of simple LCAO type (16 1, the bilocal products (15) contain additional 



summations. We found the following robust estimator of the workload of a block of the Hartree kernel 

Workload(Hartree) = Size_Of_Block • (j cuto ff • ©(o i= b) + 1) • (j cuto s ■ ©(c ^ d) + 1), (28) 

where ©(a ^ b) is equal to one for bilocal atom pairs and zero otherwise, Jcutoff is the largest angular momentum in the 



expansion (15). By default, its upper limit is set to 7. 



In the exchange-correlation kernel ( 14 1, the domain of integration is the intersection of two lens-like regions. Because 
there is no simple analytical expression for such a volume, we count the number of integration points within the support 
for each block and estimate the work load as proportional to the number of points. Rather surprisingly, the run time is 
independent of the dimension of the block (this is due to precomputing the values of the dominant products) . 

Because the kernel /h xc is frequency independent, it is computed at the beginning of the iterative procedure and dupli- 
cated on all the MPI-nodes. We found the time for gathering the matrix small compared to the time of its computation. 

6.2.3 Parallelization of the iterative procedure 



As mentioned previously in subsection 6.1.3 the iterative calculation of the polarizability tensor Pjfe(w) is naturally 
parallel both in its frequency and in its Cartesian tensor indices. Therefore, this part of the algorithm is parallelized over 
both frequency and tensor indices. However, the workload for each composite index is difficult to predict. To achieve a 
balanced workload on the average, we distribute this index cyclically over MPI nodes ("round robin distribution"). 

7 Results 

In this section, we present different absorption spectra of large molecules to validate the method and the parallelization 
approach. We start in subsection |7.1| by comparing our absorption spectra with previous calculations by other authors 
and also with experimental data. Then, in subsection |7.2[ we present the scaling of the run time with the number of 
atoms. In subsection |7.3[ we examine the efficiency of the hybrid parallelization for a variety of molecules run on different 
machines. Finally, in subsection |7.4| we present a comparison of absorption spectra of fullerene Cgo with its derivative 
PCBM that is often used in organic solar cells. 

7.1 Fullerene C 60 

We already tested the basis of dominant products in our previous works [T^l EH HI] , where the absorption spectra of 
methane, benzene, indigo blue, and buckminsterfullerene Cgo were studied. The new element in the present paper is 
the use of an iterative method without constructing the full Kohn-Sham response response function. We now test this 
method on buckminster fullerene and its functionalizcd derivative PCBM. 

Buckminster fullerene C@o is of considerable interest in materials science. Among other applications, it is used as 
an electron acceptor in organic solar cells |32) . Here we compare our results with spectra from the Quantum Espresso 
package and with experiment. 

The absorption spectrum of fullerene Cgo is shown in figure [5] where we compare our result with the calculation by 
Rocca et al. [10] and with experiment |33j . One can see a good agreement between the theoretical spectra, while both 
theoretical predictions deviate from the experimental data by 0.2 . . . 0.3 eV. The shift in the spectrum might be due to 
the solvent in the experimental setup or due to the inadequacy of the simplest LDA functional for this large molecule. 
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Figure 5: Comparison of the low-frequency absorption spectra of Cgo fullerene. Wc see a good agreement between the 
two theoretical predictions, while the experimental curve is shifted by 0.2 . . . 0.3 eV. 




Figure 6: The geometry of the longest polythiophene chain we considered. 



We used DFT data from the SIESTA package [34], where the pseudo-potentials of the Troullier-Martin type and the LDA 
functional by Perdew and Zunger |35| are applied. A double-zeta polarized basis set has been used and the broadening 
of levels has been set to 0.019 Ry. Our program spent a total of about 62.5 minutes on this calculation, of which 2098 
seconds were spent on the Coulomb kernel, 1085 seconds on the exchange-correlation kernel, and 500 seconds in the 
iterative procedure, i. e. approximately 2.27 second per frequency. The convergence parameter for the polarizability was 
set to 1%. With this convergence parameter, the dimension of the Krylov space was varying from 7 to 12 with an average 
of 8, while the dimension of the dominant product space was 8700. In this test, we used one thread on a machine with 
two CPU Intel quad core Nehalem @ 2.93GHz; 8MB cache; 48GB DDR3 RAM, and consuming no more than 2.3% of 
RAM (1.2% during the iterative procedure). 

7.2 Polythiophene chains (complexity of the method) 

In sections [4] and [5| we discussed the complexity scaling of different parts of the algorithm theoretically. In this subsection, 
we measure the dependence of run time on the number of atoms N in polythiophene chains of different lengths. We shall 
see that the run time scaling follows the theoretical predictions for the complexity. 

Sulfur containing molecules are widely use in organic electronics [3SI3S1EZ]- In this work, we study pure polythiophene 
chains of 3 to 13 repeating units. The geometry of the longest polythiophene we considered is shown in figure [6] Our 
calculations suggest (ignoring the excitonic character of these molecules) that the HOMO-LUMO energy difference 
decreases, while the absorption increases, with chain length. The calculated absorption spectra are collected in figure [7] 

We now use the calculations on polythiophene spectra in order to study the run time scaling with the number of atoms 
of different parts of our algorithm. Their scaling behavior will be described in terms of approximate scaling exponents. 
The run times for a few chains are collected in table [2] for a machine of four CPU AMD Dual-Core Opteron @ 2.6GHz; 
8MB cache; 32GB DDR2 RAM, and running sequentially. 

The application of the non interacting response to a vector consists of N 2 - and A 3 -parts (see subsection 4.2 1. The 



total time for the product \° z is collected in the third column of table [2J while the run time of the A 3 -part is collected 
in the fourth column. Using the run times, one can compute exponents x and X3 for their corresponding scaling laws N x 
and N Xa . The exponents x and X3 vary in the range of x = 2.31 . . . 2.36 and x% = 2.49 . . . 2.53, respectively. The run time 
of the Hartree kernel (fifth column) shows a scaling exponent in the range ih = 2.05 . . . 2.06, while the run time in the 
exchange-correlation kernel (sixth column) scales almost linearly x xc = 1.04. . . 1.12. Therefore, the measured exponents 
are close to the predicted exponents x = 3, 23 = 3, ih = 2, and a; xc = 1. 

The calculation of the Hartree kernel /r- via multipoles, as explained in subsection |5.1| improves the run time in the 
case of large molecules, but could not improve the run time scaling of the Coulomb interaction. In fact, the Hartree kernel 
/h is a non-local quantity and the rotations involved in the bilocal dominant products contribute a substantial part to 
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Figure 7: Comparison of low-frequency spectra for several polythiophene chains. 
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Table 2: Run time in different parts of the algorithm as a function of the number of atoms TV in the polythiophene 
chain. The third column gives the total time for the matrix-vector product (x°z), while an A 3 -part that arises in the 
construction of x° z is given in the fourth column. The fifth and sixth columns display the run times of the interaction 
kernels /h and / xc , respectively. 

the run time. 

The scaling of the run time for the entire calculation of molecular spectra will also vary with the parameters of 
the calculation. Obviously, for a small number of frequencies the scaling of total run time will be determined by the 
Hartree and exchange-correlation kernels. However, if the number of frequencies is large, then the application of the non 
interacting response \° on a vector will dominate the run time. 

7.3 Quality of the parallelization 

Our parallel Fortran 90 code is adapted to current parallel architectures (see section [6]). In this subsection, we evaluate the 
quality of the hybrid parallelization by testing our approach on three machines of different architecture. Two machines 
belong to shared-memory multi-core architectures with non uniform memory access, while the third machine is a cluster 
with 50 multi-core nodes interconnected by an Infiniband network. 

The program was compiled using Intel's Fortran compiler and linked against Intel's Math Kernel Library for BLAS, 
LAPACK and Fast Fourier Transform libraries. 



7.3.1 Multi-thread parallelization 

We consider parallelization on shared-memory machines as more important than on distributed- memory machines. There- 
fore, the speed up of our code is first tested on two shared-memory machines with Non-Uniform Memory Architecture 
(NUMA). The first machine has two quad core Nehalem 5500 CPUs (see figure [8]), while the second machine has 48 dual 
core Xeon CPUs (see figure 10 1. 



The speedup on the Nehalem machine is shown in figure [9] for three molecules of different size: benzene, indigo and 
fullerene. The speedup in both kernels and in the generation of bilocal products is good for all molecules, therefore we 
plot only the speedup for the Hartree kernel on the left panel. By contrast, the iterative procedure exhibits a lower 
speedup (as shown on the right panel in figure^). This is due to the high memory-bandwidth requirement of the iterative 
procedure. The high memory-bandwidth is clearly revealed in two distinct ways of using the same number of cores in a 
test calculation, see subsection |7. 3. 2| 

In spite of the loss of speed up for larger molecules on the Nehalem machine, we tested our implementation also on a 



parallel processor with a very large number of cores (see figure 10). The speedup is shown in figure 11 
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Figure 8: Memory/cache/core structure of Nehalem Intel Xeon X5550 machine. Two quad core nodes have a fast access 
to one of the memory banks, while the inter node communication is slower. 
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Figure 9: The left panel shows the speedup factor in the Hartree kernel for molecules of different size. The speedup in the 
exchange-correlation kernel and in the generation of dominant products is similar to the left panel. The right panel shows 
the speedup factor in the iterative procedure. We observe that the speed up decreases with the size of the molecule. 

Our results show a satisfactory speedup in both interaction kernels, while the iterative procedure again shows a poorer 
performance due its high memory bandwidth requirements, which cannot be satisfied in this NUMA architecture. 



7.3.2 Hybrid MPI-thread parallelization 

Prior to large scale computations with many Nehalem (see figure [8]) nodes, we performed test runs on one node to find an 
optimal OpenMP/MPI splitting. Four calculations were done on the Nehalem machine using all 8 cores of the machine, 
but differing in the OpenMP/MPI splitting. The results are collected in the table [3] We observed a small workload 
unbalance of 8% in case of the Hartree kernel and an even smaller unbalance of 5% in the case of the exchange-correlation 
kernel. The iterative procedure with the "round robin distribution" of frequencies over the nodes shows a even lower 
workload unbalance of 2%. The best run time is achieved in a 2/4 hybrid parallelization i.e. running 2 processes with 4 
cores each. This optimal configuration reflects the structure of the machine, where each thread shares the same L3 cache. 
The node has two processors of four cores each and a relatively slower memory access between the processors. This weak 
inter node communication results in an appreciable penalty in an OpenMP-only run (1/8 configuration), because the 
iterative procedure reads a rather large amount of data (VJf Jfjf ) twice during the application of the response function 



Xo (see subsection 4.2) 
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Figure 10: Memory/cache/core structure of Xeon-96 machine. 48 dual core Xeon CPUs are connected to four memory 
banks. Although every core can address the whole memory space, the inter node communication is slower. 
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Figure 11: Speedup on a heavily parallel Xeon-96 machine. The results are satisfactory for the interaction kernels, while 
the memory-bandwidth requirements of the iterative procedure hamper the parallel performance in the case of the larger 
fullcrcnc Cqq molecule. 



Proc / Thr 


Domi. prod. 


/h 


Jxc 


Iterative proc. 


Total 


1/8 


7.3 


271 


142 


145 


571 


2/4 


6.9 (6.9) 


273 (267) 


142 (141) 


109 (108) 


538 


4/2 


6.8 (6.8) 


274 (264) 


142 (141) 


122 (112) 


544 


8/1 


6.8 (6.8) 


274 (257) 


143 (140) 


134 (120) 


570 



Table 3: Run time and speedup factors in a hybrid MPI/OpenMP parallelization for fullerene Ceo- 
smallest run time between the nodes is stated in order to estimate the MPI work load disbalance. 



In the brackets, the 



The high memory-bandwidth requirement is clearly revealed in two distinct ways of using the same number of cores (8 
cores), either using all cores on one node or distributing them over two nodes. In the latter case, the memory-bandwidth 
is higher and the iterative procedure runs considerably faster (92 seconds versus 137 seconds in the case of fullerene Cgo)- 

We used the above optimal 2/4 hybrid configuration in a massively parallel calculation on the chlorophyll-a molecule. 
The speedup due to hybrid OpenMP-MPI parallelization is shown in figure [12] In this computation, we used up to 
50 nodes of recent generation Nehalem machines. According to the previous experiment on fullerene Ceo, we started 2 
processes per node, each process running with four threads. The two processes were placed on sockets. One can see that 
the iterative procedure shows the best speedup among other parts of the code, while total run time is governed mainly 
by the calculation of the exchange-correlation kernel. The absolute run times (including communication time) in the first 
calculation with one node (2 processes) were: total 4003 seconds, for the exchange-correlation kernel 1147 seconds, for 
the Hartree kernel 1247 seconds, for the iterative procedure 1574 seconds, and for the bilocal vertex 11.8 seconds. The 
speedup in the bilocal vertex reaches a maximum at 10 nodes because of increasing communication time. 
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Figure 12: Speedup due to hybrid OpenMP/MPI parallelization for chlorophyll-a. The job was run on up to 50 nodes 
with 2 processes per node. The code shows a linear speedup on up to 15 nodes (30 processes, 120 cores). Further increase 
of the number of nodes results in a steady acceleration of the whole program. 



Total runtime 

Exchange-correlation kernel 
Hartree kernel 
Iterative procedure 
Bilocal vertex 




The starting geometry of the molecule was taken from Sundholm's supplementary data |38j . The geometry was further 
relaxed in the SIESTA package [M] using Broyden's algorithm until the remaining force was less than 0.04 Ry/A. The 
relaxed geometry is shown in figure 13 In order to achieve this (default) criterion, we had to use a finer internal mesh 
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with a MeshCutof f of 185 Ry. The default DZP basis was used, but to achieve convergence, we used orbitals that are more 
extended in space than SIESTA's default orbitals. The spatial extension is governed by the parameter PAO.EnergyShift 
that was set to 0.002 Ry in the present calculation. The spectrum of the chlorophyll-a molecule is seen in figure [14] 
Like in the case of fullerene Ceo, there is excellent agreements between theoretical results that however differ from the 
experimental data |39j . The low frequency spectrum consists of two bands at 635 nm and 450 nm. Both bands are due 
to transitions in the porphyrin. According to our calculation, the first A band consist of 2 transitions between HOMO- 
LUMO and HOMO-1-LUMO, while the second (so called Soret) band consist of 6 transitions to LUMO and LUMO+1 
states. 




Figure 13: The relaxed geometry of chlorophyll-a molecule obtained with the SIESTA package using a DZP basis set. 
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Figure 14: Low frequency absorption spectrum of chlorophyll-a. 



7.4 Fullerene C 60 versus PCBM 

Fullerenes are often modified in order to tune their absorption spectra or their transport properties [3U 001 HI] . In this 
work, we compute the absorption spectra of [6,6]-phenyl C61 butyric acid methyl ester (PCBM) and compare with the 
spectrum of pure fullerene Ceo- We use the same parameters as in the case of Ceo in subsection |7.1| A relaxed geometry 
of PCBM was obtained using the SIESTA package [34] and using its default convergence criterion (maximal force less 
than 0.04 eV/A). Figure 
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shows the relaxed geometry. The absorption spectrum of PCBM is shown in figures 16 



and 

Figure U& shows a comparison of our calculation with recent experimental results [41] . We can see that our results 
have similar features as the experimental data: the maxima at 350 nm agree well with a broad experimental resonance at 
355 nm, and a substantial "background" at longer wavelength is present both in the calculated and in the experimental 
spectrum. In this calculation, we set the damping constant e = 0.08 Ry and compute the spectrum in the range where 
experimental data are available. However, in order to better understand the difference introduced by the functional group, 
we compute the spectra in a broader range of energies with a smaller value of the damping constant e = 0.003 Ry. The 



result is shown in figure 17 



One can sec on the left panel of figure 17 that PCBM absorbs much stronger in the visible range. This is a consequence 
of symmetry breaking and indicates a modified HOMO-LUMO gap. On the right panel of the figure, one can recognize the 
main difference between pure and modified fullerene. The high spatial symmetry of pure fullerene leads to a degeneracy 
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Figure 15: Geometry of PCBM. Relaxation is done in SIESTA package with Broyden's algorithm. 
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Figure 16: Comparison of the low-frequency spectra for PCBM with experimental data. 




Figure 17: Comparison of the low-frequency spectra Cgo versus PCBM. 
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of the electronic transitions and several transition contribute to the same resonance. The symmetry is broken in the case 
of PCBM, the degeneracy is lifted and the spectral weight is spread out. 

8 Conclusion 

In this paper, we have described a new iterative algorithm for computing molecular spectra. The method has two key 
ingredients. One is a previously constructed local basis in the space of products of atomic LCAO orbitals. The second is 
the computation of the density response not in the entire space of products, but in an appropriate Krylov subspace. 

The speed of our code is roughly comparable to TDDFT codes in commercially available software but the reader must 
understand that we cannot give any details on this touchy issue. 

The algorithm was parallelized and was shown to be suitable for treating molecules of more than hundred atoms on 
large current heterogeneous architectures using the OpenMP/MPI framework. 

Our approach leaves plenty of room for further improvements both in the method and in the algorithm. For example, 
we did not consider reducing the dimension of the space in which the response function acts, but such a reduction is 
feasible. 

Also, we chose a uniform mesh on the frequency axis while a more economical, adaptive choice is possible. We are 
working on an adaptive procedure to obtain good spectra with few frequency points and also to compute the position of 
the poles and strength of their residues. A reduction in the number of frequencies will allow to avoid the full calculation 
of the interaction kernels, replacing them by matrix-vector multiplications. There exist fast multipole methods |42| for 
computing fast matrix-vector products of the Hartree interaction. 

Moreover, for large molecules, our embarrassingly parallel approach to compute spectra induces a memory-bandwidth 
bottleneck. To avoid it, one may parallelize the frequency loop using MPI and parallelize the matrix- vector operations 
using OpenMP. 

More generally, because we do not use Casida's equations, the methods developed here should be useful beyond the 
TDDFT approach, for instance in the context of Hedin's GW approximation [33] where Casida's approach is no longer 
available. 
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